Skip to content

feat(seer): Add lightweight supergroups backfill task#112507

Merged
yuvmen merged 11 commits intomasterfrom
yuvmen/feat/lightweight-rca-backfill
Apr 10, 2026
Merged

feat(seer): Add lightweight supergroups backfill task#112507
yuvmen merged 11 commits intomasterfrom
yuvmen/feat/lightweight-rca-backfill

Conversation

@yuvmen
Copy link
Copy Markdown
Member

@yuvmen yuvmen commented Apr 8, 2026

Summary

  • Add an org-scoped Celery task (backfill_supergroups_lightweight_for_org) that iterates all error groups in an organization and sends each to Seer's lightweight RCA clustering endpoint for supergroup backfilling
  • Processes groups in batches of 50 with cursor-based pagination across (project_id, group_id), self-chaining until complete
  • Filters to error groups seen in last 90 days with unresolved substatus
  • Includes killswitch option for emergency stop
  • Designed to be triggered from a getsentry job (no API endpoint)

Test plan

  • Unit tests for happy path, cross-project processing, self-chaining, killswitch, feature flag gating, failure handling, group filtering, and cursor resumption (10 tests)
  • Manual test with Sentry org (~6000 groups) via getsentry job

yuvmen and others added 2 commits April 8, 2026 09:51
Add RCASource enum and rca_source field to supergroup query requests
so Seer knows which embedding space to query. The source is determined
by the organizations:supergroups-lightweight-rca-clustering feature
flag.

Replace the supergroups.lightweight-enabled-orgs sentry-option with
the feature flag for both the write path (post_process task dispatch)
and read path (supergroup query endpoints), consistent with how all
other supergroup features are gated.
Add an org-scoped Celery task that iterates all error groups in an
organization (seen in last 90 days) and sends each to Seer's
lightweight RCA clustering endpoint for supergroup backfilling.

The task processes groups in batches of 50 with cursor-based
pagination and self-chains until all groups are processed. Designed
to be triggered from a getsentry job.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions github-actions bot added the Scope: Backend Automatically applied to PRs that change backend components label Apr 8, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 8, 2026

Backend Test Failures

Failures on 8c9c462 in this run:

tests/sentry/taskworker/test_config.py::test_all_instrumented_tasks_registeredlog
[gw0] linux -- Python 3.13.1 /home/runner/work/sentry/sentry/.venv/bin/python3
tests/sentry/taskworker/test_config.py:120: in test_all_instrumented_tasks_registered
    raise AssertionError(
E   AssertionError: Found 1 module(s) with @instrumented_task that are NOT registered in TASKWORKER_IMPORTS.
E   These tasks will not be discovered by the taskworker in production!
E   
E   Missing modules:
E     - sentry.tasks.seer.backfill_supergroups_lightweight
E   
E   Add these to TASKWORKER_IMPORTS in src/sentry/conf/server.py

yuvmen and others added 2 commits April 8, 2026 13:18
Add backfill_supergroups_lightweight to TASKWORKER_IMPORTS so the
task is discovered in production. Fix mypy errors by asserting
event.group is not None in tests.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ight-rca-backfill

# Conflicts:
#	src/sentry/features/temporary.py
#	src/sentry/options/defaults.py
#	src/sentry/seer/signed_seer_api.py
#	src/sentry/seer/supergroups/endpoints/organization_supergroup_details.py
#	src/sentry/seer/supergroups/endpoints/organization_supergroups_by_group.py
#	src/sentry/tasks/post_process.py
#	tests/sentry/seer/supergroups/endpoints/test_organization_supergroup_details.py
#	tests/sentry/seer/supergroups/endpoints/test_organization_supergroups_by_group.py
#	tests/sentry/tasks/test_post_process.py
Replace per-group get_latest_event() calls with batched Snuba queries
via bulk_snuba_queries for the event fetching phase. Uses a tight
timestamp window around each group's last_seen. Also reduces
inter-batch delay to 1s, rewrites cursor resumption test to verify
only post-cursor groups are processed, and adds exact batch boundary
edge case test.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@yuvmen yuvmen marked this pull request as ready for review April 8, 2026 21:15
@yuvmen yuvmen requested review from a team as code owners April 8, 2026 21:15
Comment thread src/sentry/tasks/seer/backfill_supergroups_lightweight.py Outdated
Replace hardcoded substatus list with the canonical
UNRESOLVED_SUBSTATUS_CHOICES constant from sentry.types.group.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@cvxluo cvxluo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks generally good. i think we'll find some problems when we actually run the script, but we can resolve those as we go. thus, my primary concern is that we'll be sending requests to seer too fast, we get rate limited from snuba, and that we can't make this idempotent

Comment thread src/sentry/tasks/seer/backfill_supergroups_lightweight.py
Comment thread src/sentry/tasks/seer/backfill_supergroups_lightweight.py Outdated
Comment thread src/sentry/tasks/seer/backfill_supergroups_lightweight.py
for group in groups:
# Use a tight window around the group's last_seen to minimize scan range,
# falling back to the full backfill window if last_seen is unavailable
group_start = group.last_seen - timedelta(hours=1) if group.last_seen else timestamp_start
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need this fallback? seems like the original backfill job did not do this

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea agreed didnt notice this got added, probably because of some test case, will remove

Comment thread tests/sentry/tasks/seer/test_backfill_supergroups_lightweight.py
Comment thread src/sentry/tasks/seer/backfill_supergroups_lightweight.py Outdated
Comment thread src/sentry/tasks/seer/backfill_supergroups_lightweight.py Outdated
Comment thread src/sentry/tasks/seer/backfill_supergroups_lightweight.py Outdated
)


def _batch_fetch_events(groups: Sequence[Group], organization_id: int) -> list[tuple[Group, dict]]:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's going to be pretty slow to make a query per group here. And you're probably also likely to start hitting snuba ratelimits.

Do you actually need the latest event for each group, or just any event? You could group by group_id, max(event_id) to just get some event id. I don't think snuba supports window queries or anything unfortunately

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what I do here is the way they did in the V1, though things might have changed for sure. Right now I think I am okay with naively taking any event, that might change though.
Intersting suggestion about like grouping with max, Ill try and see if its fast

Copy link
Copy Markdown
Member

@wedamija wedamija Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think that at least this way you can send a batch of 100/1000/whatever groups in the same project and just get a result back. You could still batch this into multiple queries as needed, but I think it'll be much faster if you can do on average 1 query per org (probably most orgs have less than 1k groups)

success_count = 0
viewer_context = SeerViewerContext(organization_id=organization_id)

for group, serialized_event in group_event_pairs:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add a threadpool here so that we can parallelize requests?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea, v1 grouping had it, was again me trying to keep it simple but maybe ill just add it

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it'll just be horribly slow without this. Ideally the api would just accept multiple groups but if that's not worth the effort then at least using a threadpool speeds things up somewhat

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realized that actually on seer side we just queue a task and return, so its going to be fast. We could probably add a way to bad send to reduce the overhaed of all the requests but its not like we are going to be waiting a ton of time on these, so I dont think its that important right now.
As I mentioned to Mark, I will probably need to optimize more before I run this for all orgs, this is just a task to be able to do it for Sentry and some more orgs perahps to test it out, trying to not overcomplicate.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem isn't the speed of the api on the other side, it's that you're waiting for IO on this side to get anything done. The task on the other side could complete in 0 seconds and it'd still result in this being much slower. This isn't blocking though so I can approve

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea I understand, I am actually fine with this being slow, I am more worried about being too fast for Seer

…reshold

Refactor to process one project at a time using the (project, status,
substatus, last_seen, id) composite index for efficient cursor
pagination at any scale. Add MAX_FAILURES_PER_BATCH=20 to stop
processing if Seer is consistently failing. Filter by
status=UNRESOLVED. Remove dead timestamp fallback.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Comment thread src/sentry/tasks/seer/backfill_supergroups_lightweight.py Outdated
Comment thread src/sentry/tasks/seer/backfill_supergroups_lightweight.py
Comment thread src/sentry/tasks/seer/backfill_supergroups_lightweight.py
Comment on lines +33 to +38
@instrumented_task(
name="sentry.tasks.seer.backfill_supergroups_lightweight.backfill_supergroups_lightweight_for_org",
namespace=seer_tasks,
processing_deadline_duration=15 * 60,
)
def backfill_supergroups_lightweight_for_org(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How will this task be scheduled/spawned? Will we be able to spawn them incrementally over time so that we don't generate a big backlog all at once that consumes all the worker capacity preventing other tasks from running?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I currently plan to add a custom run job on getsentry to trigger manually per org, however in the future when we plan to backfill everything we will just have a loop over orgs. I dont plan to run this on multiple orgs in parallel for now.
When this was done for AI grouping v1 I believe we actually did it all project by project and basically rate limited it, so it took a ton of time (months) but keeping the rate low meant we didnt overload worker capacity / bombard seer.
Right now this task doesnt use any parallelism and just spawns the next batch after a batch is done, so I dont think its capable of consuming all worker capacity for a single org. @wedamija commented on adding a threadpool, I am considering it so this task wouldnt be dead slow, however as you mention I will indeed need to make sure we dont spawn too many threads, both for Sentry and Seer's sake.

This task is meant to be an tool to get a few orgs/projects backfilled and be able to POC on this lightweight implementation. We will need to do more tweaking to be efficient when running it for everything.

project_id=project.id,
type=DEFAULT_TYPE_ID,
id__gt=last_group_id,
last_seen__gte=cutoff,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fwiw I still think it's safer to remove this - you could also just filter out any groups outside this range on the python side, since they'll be a rare case.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea sure, I dont feel strongly about it, the snuba query will filter out anything it doesnt find events for anyway and its not like I really mind catching something as long as we retained it in Snuba I dont really have to filter this

Comment on lines +227 to +229
# Fetch full events from nodestore and serialize
group_event_pairs: list[tuple[Group, dict]] = []
for group, result in zip(groups, results):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's some bulk fetching stuff we can use here to make this a little faster

yuvmen added 2 commits April 10, 2026 14:49
- Track last_processed_group_id so early break on max failures doesn't
  skip unprocessed groups
- Stop self-chaining when max failures is reached to avoid hammering
  Seer when it's down
- Add project_id and last_processed_group_id to max failures log for
  easier resume
- Skip groups with failed event serialization instead of sending None
- Remove last_seen cutoff filter; old groups are naturally skipped
  when their events are gone from Snuba/nodestore
Use bind_nodes() for a single nodestore multi-get instead of 50
sequential get_event_by_id calls. Bulk serialize all events in one
serialize() call to batch get_attrs(). Cuts the event fetch phase
from ~3-5s to ~500ms per batch.
Comment thread src/sentry/tasks/seer/backfill_supergroups_lightweight.py Outdated
Copy link
Copy Markdown
Contributor

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit d45bf7e. Configure here.

Comment thread src/sentry/tasks/seer/backfill_supergroups_lightweight.py
…ight-rca-backfill

# Conflicts:
#	src/sentry/conf/server.py
last_processed_group_id only tracks groups with Snuba events, so
eventless groups at the end of a batch would never be skipped,
causing an infinite re-fetch loop. Since we now return early on max
failures (no self-chain), groups[-1].id is safe for the cursor.
Comment thread src/sentry/tasks/seer/backfill_supergroups_lightweight.py
@yuvmen yuvmen enabled auto-merge (squash) April 10, 2026 22:48
@yuvmen yuvmen merged commit c0d6db4 into master Apr 10, 2026
77 checks passed
@yuvmen yuvmen deleted the yuvmen/feat/lightweight-rca-backfill branch April 10, 2026 22:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Scope: Backend Automatically applied to PRs that change backend components

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants